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ABSTRACT 

Motivation: Comparative analyses of gene expression data from 
different species have become an important component of the 
study of molecular evolution. Thus methods are needed to estimate 
evolutionary distances between expression profiles, as well as a 
neutral reference to estimate selective pressure. Divergence between 
expression profiles of homologous genes is often calculated with 
Pearson's or Euclidean distance. Neutral divergence is usually 
inferred from randomized data. Despite being widely used, neither 
of these two steps has been well studied. Here, we analyze these 
methods formally and on real data, highlight their limitations and 
propose improvements. 

Results: It has been demonstrated that Pearson's distance, in 
contrast to Euclidean distance, leads to underestimation of the 
expression similarity between homologous genes with a conserved 
uniform pattern of expression. Here, we first extend this study 
to genes with conserved, but specific pattern of expression. 
Surprisingly, we find that both Pearson's and Euclidean distances 
used as a measure of expression similarity between genes depend 
on the expression specificity of those genes. We also show that 
the Euclidean distance depends strongly on data normalization. 
Next, we show that the randomization procedure that is widely 
used to estimate the rate of neutral evolution is biased when 
broadly expressed genes are abundant in the data. To overcome 
this problem, we propose a novel randomization procedure that is 
unbiased with respect to expression profiles present in the datasets. 
Applying our method to the mouse and human gene expression data 
suggests significant gene expression conservation between these 
species. 
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1 INTRODUCTION 

Changes in gene expression have been suggested to underlie many 
differences in gene function or in phenotype. More generally, 
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expression is an important component of gene function, and studying 
the evolution of gene expression is a key step in evolutionary 
genomics. While there has been a great deal of researc h concerning 
the pri mary treatment of expressi on data in general ( see lGarber et al\ 
l201lh . and lOuackenbiisbl d2002i) for reviews), there has been little 
investigation into th e methods used mo re specifically to quantify 
expression evolution dPereira gf a7.1l2009h . This can make it difficult 
to critically assess contradictory results, such as the reports that 
broad ly expressed gene s are more conserved dKhaitovich et all 
l2005h or less conserved iLiao et a/.Ll201ol:lLiao and ZhangLl2006bf) 
than specifically expressed genes. 

To assess whether and how much expression has been 
conserved between two orthologous genes by selection, we need 
an expectation for expression similarity under neutral evolution. 
Thus, the estimation of gene expression conservation requires two 
components: (i) a measure of gene expression similarity; and (ii) the 
expected value of the divergence level under neutrality. 

The two most common measures of similarity between expression 

S >rofiles of o r thologous genes are Pearson's cor relation coe f ficient 
Chan et al 



or thologous genes are Pearson s cor relation coefficient 
1 1 J2OO9): ILiao and ZhansL l2006aL 15 IXing et a/.L 12007 



lYanai et fl/.L l2004l lYang et all 120051: IZheng-Bradlev et g/lboiO ) 
and Euclidean distance Jjordan et all 120051 ; ILiao and Zhang , 



l2006al : lYanai et all 12004 . The results obtained with Pearson's and 
Euclidean distances have been reported to be poorly correlated f Liao 
and Zhang. l2006al : IPereira et all 120091) . This poses the question 
which of these measures provides a better description of expression 
similarity. It has been demonstrated that Pearson's correlation 
coefficient, in contrast to Euclidean distance, underestimates the 
expression similarity between orthologous genes with a conserved 
uniform pattern of expression. In consequence, use of the Euclidean 
distance has been encourag ed dPereira gf g/ll2009h . 

For neutral evolution, one expects that similarity between 
expression profiles of orthologous genes gradually decreases with 
time. For species that have diverged for sufficiently long time no 
detectable similarity in expression is expected to remain; this has 
been postulat ed to be the case b etween mouse and human (100 
million years; IJordan et a/.ll2005l) . It has been suggested that such 
large neutral divergence could be approximated by calculating the 
distance between expression profiles of randomly chosen pairs of 
genes from the species compared. The standard approach used 
to generate random pairs o f genes is to permute the ortholog 
relatio ns hip between them ( Chan et "all 20091: ILiao and Zhan; 
l2006aL lbl: IXmg et fl/.Ll2007l : IZheng-Bradlev et fl/.Ll201()E ~~ 
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Here, we show formally and empirical l y that, in contrast t o 
previous reports dliao and Zhand. l2006al : [Pereira et a/1 l2009h . 
there exists a relationship between the Pearson's correlation 
coefficient and the Euclidean distance, which depends on the 
data normalization. We also extend the previous study of Pereira 
et dl. (120091) by considering more than just the uniform pattern 
of expression. We demonstrate that in fact both distance measures 
depend on the expression specificity of analyzed genes. Next, we 
discuss these observations in the context of the assessment of gene 
expression conservation. We show that the comparison of expression 
profiles for randomly permuted gene pairs is biased when broadly 
expressed genes are abundant in the data, a distribution characteristic 
of many datasets. To overcome this problem, we propose a novel 
procedure to generate random gene pairs. This procedure is not 
biased by the over- or underrepresentation of any expression profile 
in the datasets. Finally, we use our approach to provide clear 
evidence for constrained evolution of gene expression between 
mouse and human. 



2 METHODS 

2.1 Gene expression data 

We used the huma n and mouse g ene expression data from the GNF Gene 
Expression Atlas oi lSu et <3/.ll2004l) as a case study. This study was performed 
on the Affymetrix HG-U133A array as well as on the custom array GNF1H 
for human, and on the custom array GNF1M for mouse. In total, expression 
profiles for 79 human and 61 mouse organs were measured, with 44928 
probe sets for human and 36 182 probe sets for mouse. We only took into 
account organs belon ging to the homologo us organ groups (HOGs) defined 
in the Bgee database iBastian et <3/Il2008|) . Using the mapping available in 
the Bgee database we could connect 36 human organs and 30 mouse organs 
to 27 HOGs. See Supplementary Table SI for the list of HOGs and their 
correspo nding organs. M icroarray data were normalized with the gcrma R 
package tWu et fl/.ll2004l) . 

To assign the probe sets to their corresponding human or mouse genes we 
used the mapping available in Bgee. We kept only probe sets which matched 
to a unique Ensembl gene. A total of 15 121 probe sets corresponding to 
13 853 mouse genes, and 23 920 probe sets corresponding to 15 338 human 
genes were found. 

To estimate the expected values of distances for gene pairs with conserved 
expression patterns, we used data from replicated experiments, performed 
in each species. Thus, for each probe set we had two vectors of values 
representing its expression over the organs. The datasets contained 36 organs 
and 23 920 probe set pairs for human, and 30 organs and 15121 probe set 
pairs for mouse. The results of the study on mouse gene expression data are 
presented in the Supplementary Materials. 

To study gene expression evolution between mouse and human we merged 
human and mouse organs into 27 HOGs. For every probe set in each 
HOG the arithmetic mean of the gcRMA normalized expression values was 
calculated (each HOG was represented by at least two microarray s). We 
used a subset of 8942 one-to-one orthologous gene pairs (see Human-Mouse 
Orthologous Genes). If the gene was matched by more than one probe set 
on the microarray, we randomly picked one probe set to represent that gene. 



2.2 Human-mouse orthologous genes 

Homology informat ion of human and m ouse genes was retrieved from 
Ensembl release 55 ( iHubbard gf all 12009b . using BioMart ISmedlev et all 
120091) . A total of 8942 pairs of human-mouse one-to-one orthologous genes 
had expression information in the datasets we used. 



2.3 Normalization procedures 

For a given gene we consider a vector x of expression intensities jc/ across n 
different organs indexed by i= 1, ...n. The Manhattan normalization of x is 
calculated by dividing it by its L 1 norm: 



Mli=I>/|. 



In some studies lliao and ZhanJ, l2006i IPereira et all 120091) this 
normalization is called relative abundance. The Euclidean normalization of 
vector x is calculated by dividing the vector by its L 2 norm: 



l|X||2 = 



M i=\ 



Finally, we introduce a so-called z-like normalization of x which corresponds 
to the Euclidean normalization of x minus its mean value: 

x—x 

7x = 



l|X-*|| 2 



2.4 Pearson's and Euclidean distances 

The Pearson's distance (dp) between two expression profiles is defined as 
1 — r, where 



1 ^(xi-x)(yi-y) 
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(1) 



is the Pearson's correlation coefficient between vectors x and y. Here the 
vector elements x; and y; are the expression signal intensities of two genes 
in the condition i, x and y are the sample means, s x and s y are the sample 
SDs. z x and z y are the z-scores of vectors x and y. 

The Euclidean distance (Je) between two expression profiles is defined as 



d E = 



X>-y;) 2 



(2) 



\|/=i 

with notations as for Equation {TJ. 

2.5 Organ specificity of gene expression 

To measure the exp ression specif i city o f human genes we used the organ 
specificity index x lYanai et~all 1 2005b . The r of a given gene with an 
expression vector x is defined as follows: 



E(i-*) 

i=\ 

n-l ' 



(3) 



where 



Hxlloo maxi</< n (x/) 
The value of x varies between 0 and 1, with higher values indicating higher 
organ specificity. 

2.6 t -group composition 

To study the relation between Je and x we used replicated expression data 
for human genes (36 organs, 23 920 probe sets). We sorted the probe set pairs 
according to the organ specificity index x [Equation J3j] of the first replicate, 
and we divided the probe set pairs into three r-groups of equal size (e.g. the 
first group contained 1/3 of the probe set pairs with the first replicate having 
lowest r). For each group we recorded the minimum and maximum r value 
of the first replicate, and used these values to filter out probe sets with the two 
replicates having x values from different groups. The resulting x -groups were 
of similar, but not equal, size (Table[fl. An alternative x -group composition, 
with a more balanced distributions of x values (first group containing genes 
with x e [0,0.2); second group with x e [0.2,0.6); and third group with r e 
[0.6, 1]) leads to unbalanced sizes of three groups. Nevertheless, for both 
approaches the results are qualitatively the same (Supplementary Figs. S6 
and S7). 



1866 



Correcting for the bias in evolutionary study of expression 



2.7 Randomization procedures 

Changes in gene expression patterns between randomly chosen genes from 
two species have been sugg ested as an approxi mation for the result of 
neutral expression evolution jjordan et all 1 20051) . We used two different 
randomization procedures to create such sets of random gene pairs. First, we 
permuted the gene order within replicates (or within species). We refer to 
these as randomly permuted pairs. Second, we performed what we refer to as 
't -uniform sampling' . We first randomly chose an organ specificity index (r), 
uniformly from the interval of (r^n, T max ), where r m i n and r max are the lowest 
and the highest values of the observed t, respectively. Next, we picked the 
gene with the value of t closest to the randomly chosen t within one dataset 
(i.e. within one replicate, or one species). Then, independently, we repeated 
the procedure for the second dataset. Thus, we obtained two randomly chosen 
genes which form a new random pair. Repeating the procedure provides the 
't -uniform' random gene pairs. 



3 RESULTS AND DISCUSSION 

3.1 Correlation between Pearson's and Euclidean 
distances depends on data normalization 

To compare gene expression between species, over many different 
conditions, it is important to normalize the expression levels between 
the conditions to obtain a common scale between species. This is 
distinct from the preprocessing normalization (within condition), 
which is typically done using method s such as LOESS (lYang et ali 
l2002h or gcRMA (IWu et all 120041) . and is not specific to inter- 
species evolutionary studies. In the following, we only consider the 
impact of the between conditions normalization on the evolutionary 
comparisons. We discuss three normalization procedures commonly 
used for evolutionary studies: Man hattan normalizat i on [al so 
referred to as 'relative abundance ' (iLiao and Zhangl l2006ah l. 
Euclidean normalization and z-like normalization (see Section 2.3 
for mathematical definition of all three normalizations). 

One can use any of these normalizations before calculating 
the Pearson's or Euclidean distance between two gene expression 
profiles. However, the choice of normalization can affect the results. 
Pearson's distance (dp) between two expression profiles remains the 
same, regardless of whether and how the data are normalized, and 
it ranges between 0 and 2. The reason is that r is defined on the 
z-scores [see Equation ([T} in Section 2.4], which are invariant with 
respect to linear transformation. In contrast, the Euclidean distance 
between two expression profiles (d^) changes its value depending 
on the normalization used, even though the interval of possible a\ 
values is always between 0 and 2. 

The correlation between dp and dy t is poor for M a nhattan 
(Supplementary Fi g. SI A; see also ILiao and Zhangl l2006al ; 
IPereira et all 120091) and Euclidean normalizations (Supplementary 
Fig. SIB). In contrast, z-like normalization leads to an 
interdependent relationship between dp and d^, defined by 

dl = 2dp (4) 

(see Theoretical Analysis in Supplementary Material, and 
Supplementary Fig. SIC). As dp gives the same results for all 
three normalizations, and for z-like normalization it is equal to 
dg/2, we focused on the Euclidean distance. If not stated otherwise, 
the Euclidean distance was calculated for all three normalizations: 
Manhattan, Euclidean and z-like, referred to as d^f , d^ and d^, 
respectively. 



Table 1. Composition of three r -groups of human probe set (ps) pairs 





Organ specificity (t) 


Number of ps pairs 


t -group 1 


0.003 <t<0. 117 


6348 


t -group 2 


0.117 <t<0.295 


5280 


t -group 3 


0.295 <r<0.879 


6692 



3.2 Commonly used measures of gene expression 
similarity depend on the organ specificity of 
the genes 

Intuitively, one might assume that the distance between two 
orthologous genes which have conserved the expression profile of 
their last common ancestor should be close to zero, and that this 
should hold regardless of the gene expression pattern. To assess if 
this is indeed the case, we performed an empirical study. We used 
human microarray data with the expression infor mation from 36 
different organs in two replicates (ISu et all I2QQ4I) . The replicates 
were used to 'simulate' pairs of genes with conserved expression 
profiles. We calculated the organ specificity index r [Equation 0] 
for each pair of replicates, and then divided them into three r- 
groups of similar size (see Section 2.6 for details). The first two 
groups contained broadly expressed genes (r < 0.295), and only the 
third group consisted of genes with more specific expression patterns 
(r> 0.295; TableHJ. 

We measured the Euclidean distances (d^f , d^ and d^) for probe 
set pairs within each r -group. The resulting levels of expression 
similarity between replicates strongly depended on the organ 
specificity level. Values of d^f and d^ were significantly lower for 
broadly expressed genes than for organ- specific genes (p< 10 -16 , 
Mann-Whitney U test, Fig. QJl and B; Supplementary Fig. S5A 
and B). In contrast, values of d^ were significantly higher for 
broadly expressed genes than for organ- specific genes (p< 10 -16 , 
Mann-Whitney U test; Fig. [if; Supplementary Fig. S5C). See 
Supplementary Figure S3 for the correlation analysis between the 
Euclidean distances and organ specificity index. 

3.3 The rate of neutral expression evolution estimated 
with randomly permuted gene pairs depends on the 
organ specificity of the genes 

The rate of neutral expression evolution is typically approximated 
by calculating the distance between expression profiles of randomly 
paired genes. The random choi ce of the genes is assu med to remove 
any similarity between them jjordan et all 120051) . The standard 
approach to generate random gene pairs is to permute the ortholog 
relationship between the genes in the datasets. We created random 
probe set pairs by permuting the probe set order within each of 
the three r -groups separately, and we then calculated the Euclidean 
distances (d^f , d^ and d^) between their expression profiles. We 
found that d^f and d^ were significantly lower for random pairs 
from the first r -group, than for random pairs from the third r -group 
(p < 10 -16 , Mann- Whitney U test; Fig. QJl and B; Supplementary 
Fig. S5A and B). This is because the first r-group consisted of 
broadly expressed genes. Consequently, even the randomly matched 
probe set pairs tended to have similar expression patterns and thus 
low distances. In contrast, the third r-group consisted of genes with 
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-• — replicates 
— permuted pairs 




0 0.005 0.01 0.015 0 0.02 0.04 0.06 0.08 0 0.5 1 



1.5 




0 0.01 0.02 0.03 0.04 0.05 0 



0.3 0 




Fig. 1. The distribution of expression similarity between human replicates depends on their organ specificity. (A) d^f and (B) are significantly lower 
for broadly expressed genes (group 1) than for organ- specific genes (group 3). For randomly permuted gene pairs d^f and djf also differ between the three 
t -groups. They are significantly lower for random pairs in group 1 than in group 3. (C) d^ is significantly higher for broadly expressed genes (group 1) than 
for organ-specific genes (group 3). d\ for randomly permuted pairs is high in all three groups, even in the first x -group, where random pairs consist of two 
broadly expressed genes (this is a consequence of low r for uniformly expressed genes). Note that the scale of the x-axis differs strongly between graphs. 



1868 



Correcting for the bias in evolutionary study of expression 



more specific expression patterns, and so the random pairs were 
truly different. 

between random pairs was not affected by organ specificity, 
in the sense that in all three r -groups the median was around 1.4 
(Fig. \V[2 ; Supplementary Fig. S5C). Values of were high even in 
the first r -group, although it consisted of random pairs with similar, 
broad patterns of expression. The reason is that d^ = ^2(1 — r) is 
a decreasing function of r, which for broadly expressed gene pairs 
reflects m ainly the noise of the measurement and is close to 0 (for 
details see IPereira et all [20091 and Supplementary Fig. S2). Thus, 
random gene pairs from the first r -group tend to have high d^ values 
(around y/l). 

3.4 A large fraction of broadly expressed genes leads to 
an underestimation of expression conservation 

Our analysis shows that if the fraction of broadly expressed genes 
is large, the level of gene expression conservation is likely to be 
underestimated. This is especially important if we consider the 
fact that housekeeping gen es (broadly expressed) a re more frequent 
than organ- specific genes dRamskold g7a/ll2009h . We found such 
skewed distributions not only in the human data considered here 
(Fig.[3]V), but also in several other datasets, e.g. most mouse genes 
are broadly expressed over different organs, most Arabidopsis genes 
are broadly expressed over different light conditions, and most 
zebrafish genes are broadly expressed over different developmental 
stages (Supplementary Fig. S4). 

To illustrate the extent to which the abundance of broadly 
expressed genes affects measures of gene expression conservation, 
we re-analyzed all the human probe set pairs, without dividing them 
into r -groups. We created random probe set pairs by permuting 
the probe set order within both replicates, and we calculated the 
Euclidean distances (d^f , d^ and d^) both for the pairs of replicates 
and for the random pairs. Ideally, one would expect to detect very 
high similarity between replicates, and very low similarity between 
random pairs. 

For Manhattan and Euclidean normalizations, distances for most 
human random pairs were very small, indistinguishable from 
the distances between replicates (Fig. |2}\ and B; Supplementary 
Fig. S8A and B). This contradicts the assumption that differences 
between randomly paired genes are to approximate well the rate 
of neutra l divergence, with very low similarity (i.e. high distance) 
expected (I Jordan fl/.Ll2005l) . For the z-like normalization, distances 
between random pairs were hig h, which is consisten t with the 
assumption of pseudo-neutrality (Ijordan et all 120051) . However 
the values for the replicates were similarly high (Fig. 
Supplementary Fig. S8C), whereas they are expected to be low. 
Thus, the presence of numerous broadly expressed genes causes 
systematically low values of d^f and d^ between randomly paired 
genes, and systematically high values of d^ between conserved 
gene pairs. The first is a consequence of the fact that it is easier 
to randomly choose two broadly expressed genes, and thus to get a 
low value of d^f or d^ . The second is a consequence of low values 
of r for uniformly expressed genes, leading to the high values of 
(as discussed in the Section 1331 ). In all cases, the level of gene 
expression conservation is underestimated. 

Although we show this effect using a specific set of human 
microarray data, our conclusions are very general and hold for 
any study in which a significant fraction of the genes is uniformly 



expressed over conditions (see Fig. S2 and its caption for a 
mathematical explanation). 

3.5 An alternative construction of random gene pairs 
improves the estimation of expression conservation 

To overcome the limitation of using randomly permuted gene pairs 
to estimate the expression divergence under neutrality, we propose 
a new procedure to create random gene pairs. This procedure 
is unbiased regardless of over- or underrepresentation of any 
expression profiles in the datasets. Consequently, it provides a better 
approximation of the expression divergence under neutral evolution 
between distant species. To generate a single random pair of genes, 
one randomly chooses two expression specificity values, x\ and T2, 
uniformly from the interval of (r m i n , r max ), where r m j n and r max 
are the lowest and the highest values of the observed r, respectively. 
Next, one picks the two genes from the two datasets that have the 
closest x values to x\ and T2, respectively. The resulting pairs of 
genes have the two x values uniformly distributed, and not biased 
as for randomly permuted gene pairs (Fig. 03 and C). 

We applied our new procedure 23 920 times to create as many 
random probe set pairs for human datasets. Then, we calculated 
the Euclidean distances (d^f , d^, and d^) both for replicates and 
random probe set pairs. We found that, relative to classical randomly 
permuted pairs, the distribution of d^ and djf for x -uniform random 
pairs differs strongly from that for replicates (Fig. |5J\ and B), 
with a high frequency of large distance values, as expected for 
very divergent pairs. Of note, d^f and d^ give the same shape of 
distribution (Figs. UK and B, andEK and B). While both of these 
measures could be combined with x -uniform sampling to estimate 
gene expression conservation, for mathematical consistency we 
prefer the use of d^ . 

The estimation of gene expression conservation with d^ cannot 
be corrected by creating the set of random gene pairs differently, 
because d^ varies significantly with organ specificity for replicates, 
i.e. for conserved genes, and not for random gene pairs. Thus, we do 
not recommend using d^ , and consequently the Pearson's correlation 
coefficient, in any study which aims to detect similarity between 
genes expressed uniformly over all conditions. 

Of note, neither the standard procedure used to generate random 
pairs, nor our new proposed approach takes into consideration the 
time passed since the divergence of two organisms. Therefore, the 
estimated 'neutral' divergence will be the same for closely related 
species (e.g. human and chimp) and more distant species (e.g. human 
and mouse). 

3.6 Results of the comparative study of human and 
mouse gene expression differ strongly according 
to the choice of randomization method 

To demonstrate the importance of our novel approach, we 
investigated how much evidence of selectively constrained gene 
expression evolution we can detect between human and mouse. We 
selected 8942 one-t o-one ortholog ous gene pairs from the human 
and mouse datasets et a/.l . 120041) . We created two sets of random 
gene pairs, using both random permutation and the procedure of 
x -uniform sampling, and we calculated the Euclidean distance 
(dj| ) for orthologous gene pairs and for both sets of random pairs 
(see Fig. S9 for analogous analysis with d^f). If the d^ value for 
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-• — replicates 

— permuted pairs 
-■ — T-uniform pairs 




Fig. 2. Overrepresentation of broadly expressed human genes causes underestimation of the conservation of expression when randomly permuted pairs are 
used to approximate the neutral evolution rate. (A, B) For most randomly permuted pairs (grey) the distance (d^f and ) is small, indistinguishable from the 
distances between replicates ( green). For r -uniform random pairs (blue) djf and d^f are higher, which is more consistent with the assumption about neutral 
evolution djordangf a/ll2005b . (C) is high both for randomly permuted gene pairs and for the group of replicates. The distribution of d^ does not change 
with the new random pairs set. 




Fig. 3. Random gene pairs have their x values differently distributed depending on the randomization procedure used. (A) r distribution for human replicates. 
The t pairs are distributed along the diagonal, which is expected for replicates. (B) r distribution for randomly permuted gene pairs. The r pairs are biased 
towards low values, which are the most frequent values in human datasets. (C) r distribution for t -uniform random pairs. The x pairs are uniformly distributed, 
and not biased towards the low values. 



a human-mouse orthologous gene pair is smaller than the fifth 
percentile of for randomly paired genes, there is some evidence 
that the expression evolu tion of this pair has been constrained 
iLiao and Zh<ml l2006al) . Using randomly permuted gene pairs 
did not provide clear evidence for constrained evolution (Fig. |4}. 
Only 8% of orthologous pairs were identified to have a conserved 
expression pattern, which was close to the random expectation of 
5%. In contrast, with T-uniform random pairs, 29% of orthologous 
genes were identified to have conserved expression (Fig. [4}. 

The number of detected genes with conserved expression pattern 
may se em surprisingly low in comparison to ILiao and Zhand 
i2006al) . who reported that as much as 84% of genes showed 
con served expression betwee n human and mouse. However, we note 
that lLiao and Zhangl (l2006ah used two different metrics to calculate 



the distance between orthologous genes and between randomly 
paired genes — the so called net distance and the Euclidean 
distance, respectively. We show that this inconsistency caused an 
overestimation of the expression conservation between human and 
mouse (see Supplementary Materials and Supplementary Fig. S10). 
Consequently, we believe that correcting for the randomization 
process yields more accurate results than a one-sided correction of 
the distance. 

We are aware that the alternative way of creating random gene 
pairs proposed in this article has some weaknesses, such as visible 
artificial peaks in the distribution (Fig. |4}, which are the 
consequence of the non-uniform distribution of x between 0 and 1 . 
This is because with the r -uniform sampling one chooses the genes 
with less frequent r values more often than genes with more frequent 
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Fig. 4. The choice of the randomization method changes the conclusions 
about gene expression evolution between mouse and human. There is no 
clear evidence for constrained evolution if we compare the distribution of djf 
for orthologous (green) and randomly permuted gene pairs (grey). Whereas, 
comparison of djf distribution for orthologous (green) and x -uniform random 
pairs (blue) suggest that expression evolution is far from neutral. 

t values. For example here, the number of narrowly expressed 
genes was increased at the expense of decreasing the number of 
broadly expressed genes. Consequently, when only a few genes 
have a r value in some non-negligible range, these few genes might 
repeat many times in the randomized set, and discrete effects may 
manifest themselves causing artificial peaks. Note that the peaks 
would disappear if r values were uniformly distributed between 0 
and 1, but then there would be no need for r -uniform sampling of 
gene pairs at all. Note also that the peaks do not affect the analysis, 
as they do not change the overall shape of the distribution of distance 
values between the randomized gene pairs (Fig. [4}. 

Finally, one may argue that the r -uniform sampling contradicts 
the very purpose of randomization because it makes a probability 
of choosing a gene higher, if its r value is underrepresented in the 
dataset. But the aim of the set of randomized gene pairs is not to 
be 'just random', but to display maximal divergence between gene 

i > airs, i.e. to simulate the neutral evolution defined in I Jordan et al\ 
2005b . In contrast to the standard approach, the r -uniform sampling 
makes the distribution of distance values between gene pairs actually 
independent of the r distribution observed in the analyzed dataset. 
Thus, we believe that the distance between r -uniform random gene 
pairs approximates better a large neutral divergence. 



4 CONCLUSIONS 

The Euclidean distance should be used with caution as an estimator 
of gene expression conservation because it varies as a function of 
expression specificity. Our results strongly suggest that to assess 
whether gene expression evolves neutrally, one should use 
(Euclidean distance preceded by Euclidean normalization) and 
compare its distribution for orthologous and r -uniform random 
pairs. Importantly, we validated this approach on real data, and 
recovered clear evidence for gene expression conservation between 
mouse and human. Previous small differences reported between real 
and random gene pairs were likely caused by the way the random 



pairs were constructed dLiao and Zha"rill2006aL lbl). Although in this 
study we applied our approach to microarray data analysis, the issues 
highlighted here are also relevant to data acquired with RNA-seq 
technology dMortazavi et a/.L[20Q8h . 

We would like to emphasize that while it is possible to verify 
whether the expression of a given set of genes was under selective 
pressure, there is no straightforward way to compare the strength of 
selection acting on two groups of genes with different expression 
patterns. Indeed, if we compare a group of broadly expressed 
genes with a group of narrowly expressed genes, with similar high 
conservation of expression, the latter will always have higher dj| 
values (and lower values). This methodological problem suggests 
a need to re-interpret results from previous evolutionary studies 
comparing the evolution of broadly and narrowly expressed genes. 
In particular, studies which have reported higher conservation of 
organ- specific genes Jliao et all l2010l : iLiao and Zhangj l2006bl : 
iMovahedi et all l201ll) could have been biased by the fact of using 
the Pearson's correlation coefficient (equivalent to d^) as a measure 
of conservation. 

In this article, we thoroughly analyzed, formally and experi- 
mentally, the common measures of expression conservation, and 
we showed the superiority of the Euclidean distance paired with 
the Euclidean normalization. We also highlighted the limitation of 
using randomly permuted pairs to approximate neutrally evolving 
genes, and proposed a new methodology to better estimate the rate 
of neutral evolution. With the increase of expression data for many 
species, our work is likely to become very useful for evolutionary 
studies of gene expression. 
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